Predicting gene expression level from DNA methylation profiles

The main approach is to predict gene expression from the mean methylation level of the corresponding promoter region. Our approach is to fita methylation profile for each promoter region and then predict gene expression from the shape of methylation profiles. This can be thought as extracting more features from DNA methylation data and using them to predict transcription abundance. However, this approach is computationally more expensive since for each promoter we need to maximize the following quantity:

\[ p(\mathbf{y}_{i}|\mathbf{f}_{i}) = \prod_{l=1}^{L} p(y_{il}|f_{il}) = \prod_{l=1}^{L} \binom{t_{il}}{m_{il}} \Phi(f_{il})^{m_{il}} (1 - \Phi(f_{il})\big)^{t_{il} - m_{il}} \]

The f are basis functions which are squashed through the probit transformation in order to lie in the [0, 1] interval. Currently, rbf basis functions are implemented.

After learning the methylation profiles for each region, we use a regression model to predict gene expression. The default is to use the SVM regression model, but other approaches are implemented such as Random Forests and Multivariate adaptive regression splines.


Some initial results on the ames mouse data

We model the methylation profiles using 9 RBF kernels (i.e. basis functions). The learned parameters for these basis functions will be our extracted features that we will use to predict gene expression levels. A region of 10kb is taken for the promoter regions, which are centred around the TSS, so 5kb upstream and 5kb downstream of TSS. The chromosomes chrLambda and chrM where discarded from further analysis. Reads with less than 4 coverage were also discarded and promoter regions should contains at least 15 CpGs so as to be considered for further analysis. The gene expression data are in FPKM. We log2 transform the FPKM values to reduce variation and to avoid the \(log2(0)\) issue, \(\alpha = 0.1\) was added to all the counts. Finally, we remove genes that have expression levels above 600, considering them as noise.

# Plot some learned methylation profiles for DF Young mouse
t = 1569
gene_name <- df_young_genes$gene_name[t]
# Methylation profile using 9 RBFs
plot_fitted_profiles(region = t, 
                     X = df_young_X, 
                     fit_prof = df_young_out_prof, 
                     fit_mean = df_young_out_mean, 
                     title = paste0("Gene ", df_young_genes$gene_name[t]))

# Plot some learned methylation profiles for DF Young mouse
t = 1533
gene_name <- df_young_genes$gene_name[t]
# Methylation profile using 9 RBFs
plot_fitted_profiles(region = t, 
                     X = df_young_X, 
                     fit_prof = df_young_out_prof, 
                     fit_mean = df_young_out_mean, 
                     title = paste0("Gene ", df_young_genes$gene_name[t]))


Methylation profiles for the Dip2a gene

As we can observe the latent basis functions can capture the rich patterns present in the WGBS data. We argue that there is a functional role for the shape of methylation profiles.


Using the learned profile parameters as input features, we train an SVM regression model using 70% of the data as training data, and the rest 30% as test data. So the training data consist of 10 input features w and the corresponding gene expression level y for each gene promoter region. Below we can see how well we can predict gene expression levels from DNA methylation patterns in the test data. The x-axis corresponds to the measured gene expression and the y-axis to the predicted gene expression levels. R denotes the Pearson’s correlation coefficient.

DF Young Results

DF Old Results

N Young Results

N Old Results

Predicting gene expresion levels from different mouse models

Our next step is to ask if methylation patterns are global across different cell lines and different mouse models. That is, learning the methylation profiles (i.e. extract higher order methylation features) for one mouse model, e.g. DF Old, can we use them to predict gene expression levels for a different mouse model, e.g. N Old?

The process is the following:

If there are global proerties of methylation patterns, then we expect similar correlation values to what we found on the previous section.

Results

Confusion matrix of predictions from one mouse model to the other mouse models when using the methylation profiles and the mean methylation as input features in the SVM regression model.



Clustering methylation profiles

To cluster methylation profiles we consider a mixture modelling approach. We assume that the methylation profiles can be partitioned into at most K clusters, and each cluster \(k\) can be modelled separately using the binomial distributed probit regression function as our observation model, which we introduced in the beginning of this document.

To estimate the model parameters \(\mathbf{\Theta} = (\pi_{1}, \ldots , \pi_{k}, \mathbf{w}_{1}, \ldots, \mathbf{w}_{k})\), the Expectation Maximization (EM) algorithm is considered. EM is a general iterative algorithm for computing maximum likelihood estimates when there are missing or latent variables, as in the case of mixture models. EM alternates between inferring the latent values given the parameters (E-step), and optimizing the parameters given the filled in data (M-step). Using the proposed observation model, direct optimization of the likelihood w.r.t parameters \(\mathbf{w}_{k}\) is intractable, thus, we have to resort to numerical optimization strategies, such as Conjugate Gradient algorithm.


Clustering methylation profiles using K = 5

(A) In the above figure we show the five clustered methylation profiles over \(\pm 5\)kb promoter region w.r.t. TSS in the direction of transcription for the four ames-mouse models. Each methylation profile is modelled using four RBFs. Comparing the clustered profiles it is evident that there are five prototypical methylation shapes across the mouse models. (B) Boxplots with the corresponding expression levels of the protein-coding genes assigned to each cluster for each of the four mouse models. The colours match with the clustered methylation profiles shown above. The numbers below each boxplot correspond to the total number of genes asssigned to each cluster.

We can observe that the methylation profiles across Young and Old mouse models are somewhat different (e.g. observe the red and orange profiles). Although, it should be noted that we observe changes in the methylation profiles when we change the promoter region. ***

Clustering purity

Using the above clustered methylation profiles we consider the purity of clusteirng across different cell lines, i.e., which fraction of genes assigned to a certain cluster in a certain cell lines are assigned to the same cluster in the other cell lines. To do this we create the Venn diagrams shown below, where each Venn diagram corresponds to a specific cluster.

# Cluster 1
knitr::include_graphics("../figures/final_comm_cluster_1_5000_5_4.png")

# Cluster 2
knitr::include_graphics("../figures/final_comm_cluster_2_5000_5_4.png")

# Cluster 3
knitr::include_graphics("../figures/final_comm_cluster_3_5000_5_4.png")

# Cluster 4
knitr::include_graphics("../figures/final_comm_cluster_4_5000_5_4.png")

# Cluster 5
knitr::include_graphics("../figures/final_comm_cluster_5_5000_5_4.png")


Identifying DMRs using M3D

In order to identify possible Differentially Methylated Regions (DMRs) across mouse models, we used the M3D statistic which takes into account the shape of the methylation profile to provide a more powerful statistical test for DMRs. Since the coverage of the WGBS data is low, we pooled 2 replicates together, thus we ended up in having 2 replicates for each cell line, instead of 4. We should note that the M3D needs 2 replicates from each group that will test for DMRs in order to generate the Null distribution, which will be used in order to compare the between group different when performing the statistical test.

As regions of interest we took again the promoter regions and run the M3D model for all possible combinations of mouse models. The ‘called’ DMRs (i.e. promoter regions) are shown below using the Benjamini-Hochberg adjusted \(p-values < 0.05\) for defining statistically significant changes. As we observe only a few regions were called, implying that the methylation profiles remain the same across all mouse models for most of the promoter regions.

Sample plots showing DMR profiles

List of called DMRs

# Gene names of DO-NO DMRs
print(diff_genes_do_no$gene_name)
##  [1] "Rgs20"         "C4bp"          "Gm17664"       "Serpina1e"    
##  [5] "Serpina11"     "Serpina3k"     "Akr1c6"        "Cmah"         
##  [9] "C9"            "Ugt3a1"        "4930548G14Rik" "Cyp2d10"      
## [13] "Kng2"          "Igfals"        "Cyp2c67"       "Adh6-ps1"     
## [17] "Adh4"          "Grhpr"         "Mup7"          "Mup1"         
## [21] "Mup10"         "Mup12"         "Mup19"         "Mup3"         
## [25] "Ugt2b5"        "Slc26a1"       "Fabp1"         "Pzp"          
## [29] "Slco1a1"       "Amn1"          "Cyp2b13"       "Cyp2b9"       
## [33] "Mrgpra2b"      "Mrgprb4"       "Slc10a2"       "Ces3a"        
## [37] "Ces3b"         "Nudt7"         "A230050P20Rik" "Glyctk"
# Gene names of DY-DO DMRs
print(diff_genes_dy_do$gene_name)
## [1] "Rgs20" "Thrsp"
# Gene names of DY-NY DMRs
print(diff_genes_dy_ny$gene_name)
##  [1] "Rgs20"         "5330438I03Rik" "Gm17664"       "Serpina11"    
##  [5] "Serpina3k"     "Cmah"          "C3"            "Thrsp"        
##  [9] "Ces3a"         "Nudt7"
# Gene names of NO-NY DMRs
print(diff_genes_no_ny$gene_name)
## [1] "Rgs20"   "Igfals"  "Cyp2c67"